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MAXIMUM  LIKELIHOOD  ESTIMATION  OF  THE  PARAMETERS 
OF  A MULTIVARIATE  NORMAL  DISTRIBUTION 

by 

T.  W.  Anderson  and  I.  Olkin 
Stanford  University 

ABSTRACT 

This  paper  provides  an  exposition  of  several  alternative  techniques 
used  to  obtain  maximum  likelihood  estimators  for  the  parameters  of  a 
multivariate  normal  distribution.  In  particular,  matrix  differentiation, 
matrix  transformations  and  induction  are  treated.  These  techniques  are  used 
to  derive  the  maximum  likelihood  estimators  of  the  covariances  of  a Wishart 
distribution,  of  the  covariances  when  there  are  missing  observations, 
and  of  the  means  under  a rank  constraint.  Although  the  paper  is  mainly 
expository,  some  of  the  proofs  are  new. 


Key  words:  multivariate  normal  distribution;  Wishart  distribution; 

maximum  likelihood  estimation;  maximization  subject  to 
a rank  constraint;  matrix  transformations;  matrix 
differentiation;  multivariate  inequalities. 
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Maximum  Likelihood  Estimation  of  the  Parameters 


of  a Multivariate  Normal  Distribution 

1.  Introduction. 

Consider  the  maximum  likelihood  estimation  of  the  mean  u and 
covariance  matrix  I of  a normal  p-variate  distribution  based  on 

observations  x^ x^.  The  maximum  likelihood  estimator  of  the 

mean  is  the  sample  mean  x,  and  the  logarithm  of  the  concentrated 
likelihood  is  - | pN  log  2it  + -j  N times 

(1)  f(G)  - -log|G|  - tr  G_1V 
or 

(2)  g(H)  - log|H|  - tr  H V , 

where 

(3)  V - i z 

a-1 

is  positive  definite,  G represents  the  covariance  matrix  Z and  H 
the  inverse  I-*  . The  problem  is  to  maximize  (1)  with  respect  to 
positive  definite  G . Because  H is  a one-to-one  transformation  of 
G , it  is  equivalent  to  maximize  (2)  with  respect  to  positive  definite 
H . (This  is  Lemma  3.2.2  of  Anderson  (1958).) 

We  use  the  notation  A > 0 to  mean  that  the  symmetric  matrix  A 
is  positive  definite. 


2. 


We  shall  describe  three  alternative  approaches:  differentiation, 
matrix  transformations,  and  Induction,  each  of  which  can  be  applied 
to  f (G)  or  g(H).  It  Is  expected  that  these  techniques  will  be  useful 
in  other  problems  of  maximization.  Three  examples  are  given  to 
illustrate  the  ideas. 

Remark.  It  is  somewhat  surprising  that  an  early  reference  to  the  fact  that  the 
sample  covariance  matrix  is  the  maximum  likelihood  estimator  of  7 is 
elusive.  The  general  result  is  implicit  in  the  work  of  Wilks  (1932,  p.  476) 
dealing  with  likelihood  ratio  tests. 

2.  The  Method  of  Differentiation. 

The  functions  f(G)  and  g(H)  go  to  -»  as  G or  H approaches 
the  boundary  of  positive  definite  matrices  or  as  one  or  more  elements 
increases  without  bound.  From  the  facts  that  log|H|  is  concave  (see 
Bellman  (1970),  p.  128)  and  tr  H is  linear,  it  follows  that  g(H) 
is  concave  in  H > 0 , so  that  a maximum  exists,  and  it  is  unique.  The 

function  f(G)  is  neither  concave  nor  convex.  However,  since  there  is  ! 

i 

only  one  solution  to  the  derivative  equations,  these  equations  yield  the 
maximum. 

To  obtain  the  derivative  equations  we  use  the  differential  form 

I 1 

I 

G 

(3)  d[f  (G)  ]±J  - (2— 6±j ) {-  yij.  d8ij  + (tr  G_1  G_1  V)  dg^}  - 0 , i < j , j- 

where  6..  is  the  Kronecker  delta,  E. . is  a matrix  with  1 in  the  (i,j)th  | 

1J  l 

position  and  0 elsewhere,  and  G^  is  the  cofactor  of  gjj  . This  yields  !’ 

the  matrix  equation  ‘ 

’■i 

(4)  -G_1  + G”1  V G'1  • 0 , 

i 

i 

4 

^ i 

with  the  unique  solution  G * V . 


3. 


Similarly,  using  g(H), 


<5>  *(*<»>  ly  - dhy  - (tr  EtJ  V)  dhj  - 0 , t < j , 


which  yields  the  equation 


H"1  - V - o , 


with  the  unique  solution  H - V_1 

In  essence,  this  approach  applied  to  f(G)  has  been  discussed  by 
Smith  (1978);  the  approach  applied  to  g(H)  was  used  by  Anderson  (1958) 
For  a more  detailed  discussion  of  differentials  see  Deeaer  and  Olkln 
(1951),  or  Anderson  (1958)  p.  310. 

3*  The  Method  of  Matrix  Transformations. 

The  functions  f(G)  and  g(H)  can  be  written  in  canonical  forms. 
For  any  matrix  C such  that  CC'  - V,  let 


- C^GC'”1  , H 


C'HC  . 


Then  (1)  and  (2)  yield 


f (G)  - log  | V|  - - log  |g|  - tr  G_1  , 
g(H)  - log  |v|  - log|ii|  - tr  H . 


Hence  we  shall  take  V - I and  drop  the  tilde. 


4 


' We  use  three  well-known  representations  for  a positive  definite 

( matrix,  namely, 

! 

f (i)  H - TT’  , 

; where  T is  upper  (or  lower)  triangular; 

(ii)  H - D R D , 

f 

i 

where  D - diag(*H^,  ...»  , and  R - (r^)  is  a correlation 

\ 

! matrix,  (i.e.,  r^  • 1); 

(in)  h ■ r od  r1  , 

£ 

| where  T is  orthogonal,  « diag(dlt  ....  dp)  and  d^  dp 

are  the  characteriatic  roots  of  H. 

In  each  case  the  problem  reduces  to 

» 

(6)  Max  (log  z - z). 

z>0 

The  function  log  z - z is  concave  and  has  the  unique  maximum  of  -1 
at  z ■ 1. 

3.1.  Transformation  to  Rectangular  Coordinates:  H - TT'  . Now  the 
maximization  of  g(H)  becomes 


i 1 


(7) 


Max 

Cii’ tij 

i<J 


(log 

1 


'ii 


- t 


ii 


- I 

i<J 


'2y> 


Clearly,  the  maximum  over  t^,  i < j occurs  at  t^  " 0,  so  that 
(7)  reduces  to  a sum  of  terms  like  (6) . 
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3.2.  Transformation  to  a Correlation  Matrix:  H ■ D R 0 , Now  the 


maximization  of  g(H)  becomes 


Max  { l (log  hli  - hlt>  + log  ML 


hii>0  1 


Since  R is  a correlation  matrix,  by  Hadamard's  determinant  inequality 


|r|  <_  Hr^  * 1*  with  equality  for  R ■»  1.  Alternatively,  we  can  write 

R - uu',  where  U is  upper  triangular  with  £ u^  * 1;  consequently, 

j-1 

2 

|r|  - n u^  1.  Now  (8)  reduces  to  a sum  of  terms  like  (6). 


3.3.  Transformation  to  Characteristic  Roots:  H mT  I*'  . 
mization  of  g(H)  becomes 


Now  the  maxi- 


Max  {£  (log  d1  - di)}. 


di>0  i 


which  is  of  the  fora  (6).  The  transformation  3.3  has  been  suggested  by 
Anderson  (1958)  Problem  4,  Chapter  3,  and  has  been  used  by  Watson  (1964); 
it  is  the  essence  of  the  method  of  Khatrl  (1979)  and  Tamhane  (1979). 

4.  The  Method  of  Induction. 


Write 


H11  H12 


H21  h22 


, Hu:  p - 1 x p 


then  the  maximization  of  g(H)  becomes 


5.  Other  Methods. 

If  f (G)  and  g(H)  are  not  put  Into  canonical  form  several  other 
methods  may  be  employed.  There  is  a nonsingular  matrix  Q such  that 


V - QQ\  G - QOxQ*  , 

where  Is  a diagonal  matrix  with  diagonal  elements  the  roots  of 

|G  - AV|  - 0.  Then 

f (G)  - - 2 log | Q|  - f (log  A.  + . 

1-1  1 

which  is  maximised  for  ^ - 1;  that  Is,  6^  « I,  G - V . 

Another  method  invokes  a theorem  of  von  Neumann  (1937) : 

,rHV- X Vp-i+i 


• It  I rrtlliMlliiiaiiSiMai’hl'-l'ir'-'gt 


i 
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where  n,>  . ..>  0 are  the  ordered  characteristic  roots  of  H and 
1 P 

v^>  ...>  are  the  ordered  characteristic  roots  of  B;  equality  is 
attained  when  the  characteristic  vectors  of  H are  identical  to  those 
of  V We  have 


*(H)  1 j1  (1°8  \ - niVl+X> 

with  equality  when  the  characteristic  vectors  of  H are  identical  to 

those  of  V.  Then  (log  Vp«i+iH8  maximized  at  ^1  " 1/vp-i+l  * 

i » 1,  ...»  p.  Thus  g(H)  is  maximized  at  H » V ^ . This  method  is 
used  by  Theobald  (1975). 

f>.  A Modified  Model  with  Missing  or  Additional  Observations. 

Suppose  a sample  of  size  N is  observed  from  a p-variate  normal 
distribution  with  covariance  matrix  E,  and  an  additional  sample  of  size 
M is  observed  on  the  first  k (out  of  p)  variates.  Alternatively, 
this  model  can  be  viewed  as  a sample  of  size  tf+M  from  a p-variate  normal 
distribution,  where  M observations  on  the  last  p-k  (out  of  p)  variates 
are  missing.  This  problem  was  considered  by  Anderson  (1957)  and  in 
another  context  by  Olkin  and  Sylvan  (1977). 

The  problem  now  is  to 

(9)  Max  {-  N log | G|  - tr  G_1V  - M logjfijJ  - tr  g’Jw}  , 

G>0 

(G11  Gi ?\ 

„ „ 1,  G.,:  k x k,  V:  p x p and  W:  k x k are  positive 

®21  °22  11 


■i 

i 


definite 
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6.1.  Reduction  to  a Canonical  Fora.  In  terns  of  H ■ G~  , the  problem 
becomes 


<10)  l08^‘  tr  H V + M 1o»Ihu  - H12H22H21 1 ' tr(Hll  “ H12H22H21)W} 

■ (to  {»  1o*|H22|  + (»  + M)  log|Hu  - H12H^21|  - tr  euVn  - tr  H12v21 


tr  H21V12  - tr  H22v22  - “ H12H22H21)  * 


6.2.  Transformation. 

One  can  think  of  this  likelihood  as  composed  of  the  marginal  like- 
lihood  of  the  N + M observations  on  the  k variables  and  the  conditional 
likelihood  function  of  the  N observations  on  the  p - k variables 
conditional  on  the  N observations  on  the  k variables.  The  appropriate 
parameters  are 


H11  - • J-  H22H21  ■ *•  H22  • H22  • 


(If  H - G"1,  then  J - G“J,  K - - g2iG11»  H22  “ (G22_G21GilG12)1') 

The  corresponding  transformation  of  the  covariance  matrix  of  N observations 


V22  * V21VUV12  * P'  V21VU  ■ '•  VU  ’ ’ll  ’ 


'll  'll 


I 
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Then  (10)  Is  transformed  to 

(11)  Max  (N  log  H__  - tr  H..F  + (N+M)  log|j|  - tr  J(V  + W) 

h22>o,j>o,k  " 11 

- tr  H22(K+E)V11(K+E)'}. 

The  last  term  is  nonpositive  because  H22  and  are  positive  definite; 

/s 

its  maximum  of  0 occurs  at  K ■ - E ■ - ^21^11  * Then  O-D  becomes  the 
sum  of  two  terms  like  (2).  The  maximum  over  H22  > 0 and  J > 0 occurs  at 
H22  ” NT'1  and  J - (N4M) (V11+W)-1  . 

It  might  be  noted  that  this  last  analysis  is  applicable  to  the 
following  problem.  Given  a matrix  X with  regression  KZ  and  covariance 
matrix  H22^’  t*ie  likelihood  function  is  (11)  with  J » 0, 

E - X’Z(Z'Z)-1  and  F - X'X  - EZ’ZE'  . 


6.3.  Differentiation. 

We  now  solve  (9)  by  differentiation;  this  alternative  may  have  some 
intrinsic  interest.  Using  differential  forms,  we  obtain 


(2-V 


1% 

\W 1 


gij+(tr  G~1EijG~1V)dgij 


V • 


0, 


i j , where  A = G^  and  A*j  is  the  cofactor  of  a^  in  A.  This 


yields 

the  matrix  equation 

°ii  °1 

/°u"°u 

i 

(12) 

-NG_1  + G‘1VG_1  - M 

L 0 °i 

\ 0 0 J 

= 0 . 
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Pre-  and  post-multiplication  by  G in  (12)  yields 


/ g7* 

0 / G~* 

W 

0 

1 11 

/ 11 

11 

(13) 

-NG  + V -MG 

G + G 

G = <-  , 

\ 0 

0 l 

0 

0 

(15a>  £u  ’ 5S  <"11  + ">  • 

<15b)  612  ■ 55?  <v+u)vul"i2  • 


<!5c:)  G22  - N V22  + (u+m)  v21vll  ""ll^"l2  " N(N+M)  v21v11  v12  * 


7.  A Problem  of  Rank. 

Let  the  N columns  of  X and  the  M columns  of  Y be  indepen- 
dently distributed  according  to  a p-variate  normal  distribution  with 
covariance  matrix  £ and  EX  ■ 0,  EY  ■ where  <f>  is  of  rank 

r <_  p < M . This  model  is  considered  by  Anderson  (1951). 


I 


To  obtain  the  maxima  likelihood  estimator8  of  T.  and  4 we 


start  with  the  likelihood  function 


(16) 


c exp{-  j tr  E_1[X  X'  + (Y-4) (Y-4) ’ ] } , 


where  c is  a normalizing  constant.  From  Sections  1,  2,  and  3,  the 
maximum  of  (16)  with  respect  to  E (for  fixed  4)  is 


9 - [X  X'+  (Y  - 4)  (Y  - 4>y]/(N+M)  , 


so  that  we  need  to  determine 


(17)  Min  ll)  - Min  |x  x*+  (Y  - 4)  (Y  - 4)1 

$ 4 

- Min  Jx  X * 1 1 1 + (X  X,)Ji(Y-*)(Y -*4)'(X  X') 
4 


To  simplify  notation,  write 


(18)  Y - (X  X’)"**  Y , 4 - (X  X')"**  4 , 

then  (17)  becomes  (except  for  the  term  |x  X* | ) 


(19) 


Min  |l  + (Y  - 4)  (Y  - 4)1  . 
4 


(20) 


Since  4:  p x M is  of  rank  r,  we  can  write 

♦ • <Ti  T*>  nl(M  “ T.A.  , 


12. 


1 I 


t! 


: r x M and 


/M 

where  (T.  T,)s  p x p,  T.:  px  r;  4 • . ! M*  M,  A, 

' 2 - - /Tn| 

A Is  orthogonal.  In  the  representation  (20),  Q If  ■ j ^ j , 


where  T^s  r x r Is  lower  triangular  with  positive  diagonal  elements. 


then  the  representation  * ■ A^  la  unique. 
In  (20)  note  that 


(21) 


1 1 + (Y  - <J>)  ( Y - 4>)'| 

P 


i t I 


I + Y Y'+  - TjAjY*-  Y A,^ 


|(I  + Y Y’-  YAJA1Y*)+(T]l  - YAj)(T1  - YA^)'| 


- | I + Y aJa2Y'+  (Tx  - YAj)^  - YA*)'|  . 


Since  I + Y Is  positive  definite,  the  minimum  of  (21)  over 


Is  achieved  at  ■ YA^,  in  which  case  we  need  to  determine 


(22) 


Min  1 1 + Y a'a2Y*|-  Min  1^  + AjY'Y  A’| 

a2  a2 


- Min  lAjO^  + Y'Y)A2| 
A2 


The  minimum  of  (22)  la  X^  , where  X^  > ...  > X^ 

— ** 

are  the  ordered  characteristic  roots  of  I + Y'Y  ■ I + Y\XX*)  Y , that 
is,  the  roots  of  |(X*X  + Y'Y)  - XX'X)  - 0 . (See,  e.g..  Bellman  (1970), 


X 

t, 


The  model  (16)  is  considered  by  Healy  (1979).  His  procedure  is  to 
first  make  a series  of  transformations  motivated  by  the  rank  condition 
on  ♦ , which  transforms  the  model  to  a canonical  form,  and  then  to 
carry  out  the  maximizations.  However,  by  first  maximizing  with  respect 
to  the  covariances  (as  in  the  above  derivation),  the  required  transforma- 
tions (and  proof)  become  considerably  simpler. 
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